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Central to the gravitational wave detection problem is the challenge of separating features in the 
data produced by astrophysical sources from features produced by the detector. Matched filter- 
ing provides an optimal solution for Gaussian noise, but in practice, transient noise excursions or 
"glitches" complicate the analysis. Detector diagnostics and coincidence tests can be used to veto 
many glitches which may otherwise be misinterpreted as gravitational wave signals. The glitches 
that remain can lead to long tails in the matched filter search statistics and drive up the detec- 
tion threshold. Here we describe a Bayesian approach that incorporates a more realistic model for 
the instrument noise allowing for fluctuating noise levels that vary independently across frequency 
bands, and deterministic "glitch fitting" using wavelets as "glitch templates" , the number of which 
is determined by a trans-dimensional Markov chain Monte Carlo algorithm. We demonstrate the 
method's effectiveness on simulated data containing low amplitude gravitational wave signals from 
inspiraling binary black hole systems, and simulated non-stationary and non-Gaussian noise com- 
prised of a Gaussian component with the standard LIGO/ Virgo spectrum, and injected glitches of 
various amplitude, prevalence, and variety. Glitch fitting allows us to detect significantly weaker 
signals than standard techniques. 



I. INTRODUCTION 

To take full advantage of gravitational wave detec- 
tors we must confidently distinguish weak gravitational 
wave (GW) signals from detector noise. This challenge, 
dubbed the "detection problem," warrants significant at- 
tention as we prepare for the first direct measurement 
of gravitational waves within the next decade. In their 
current operational configuration, the Laser Interferom- 
eter Gravitational wave Observatory (LIGO) Q and 
Virgo 0, encounter frequent large-amplitude transient 
noise events, or "glitches", while detectable GW signals 
are forecast to be rare and weak. The LIGO- Virgo col- 
laboration has developed several practical techniques for 
addressing the detection problem in this context (see e.g. 
Refs. [3, |j|), but there is considerable room for improve- 
ment. 

We propose a new approach to the detection problem 
that works by simultaneously modeling the instrument 
noise and the gravitational wave signals. It can be argued 
that the existing search algorithms incorporate implicit 
noise modeling as the search statistics are tuned using 
time slides of the data (the time shifts destroy the coher- 
ence of the gravitational wave signals while preserving the 
noise properties in a statistical sense). In our Bayesian 
approach to the detection problem we must use an ex- 
plicit noise model, which in turn defines the likelihood 
function. We wish to stress that the issue is not whether 
a Bayesian or frequentist approach is superior, but rather 
that the "tuning" of the search must proceed in a differ- 
ent fashion. In the frequentist approach, one or more 
statistics are chosen that (hopefully) indicate whether a 



gravitational wave signal is present in the data. Time 
slides and signal injections are then used to produce esti- 
mates of the false alarm and false dismissal probabilities 
for the statistics. The choice of statistics can be refined 
during this procedure in an effort to minimize the false 
dismissals and false alarms. In a Bayesian approach the 
analysis is fully determined by the choice of likelihood 
function and prior. Once these have been chosen the 
analysis is purely mechanical, there are no thresholds to 
set or statistics to tune. As Laplace put it "the theory of 
probabilities is basically just common sense reduced to 
calculus" Though the well defined probability calcu- 
lus of Bayesian analysis is appealing, the output is only 
as good as the input. Bayesian analyses of LIGO- Virgo 
data that assume a Gaussian likelihood function 
should be treated with caution. 

Rather than making guesses about the noise model, 
and hence the form of the likelihood function, Bayesian 
inference can be used to determine the noise model from 
the data @. To this end, we introduce several parameter- 
ized models for the noise, and use the data to jointly es- 
timate the noise and signal parameters. Bayesian model 
selection is applied to alternative parameterizations of 
the noise in an effort to find the most parsimonious repre- 
sentation. The process of designing these parameterized 
likelihood models is similar to the process of designing 
statistics for a frequentist analysis. The main difference 
is that the "tuning" of the likelihood models occurs me- 
chanically, without fear of operator induced biases. 

We consider a variety of noise models, and in tests per- 
formed on simulated LIGO/ Virgo noise, we find that a 
model that combines a description of Gaussian noise with 
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a time varying power spectrum and a wavelet model that 
is able to fit semi-coherent glitches best represents the 
data. While we were careful to ensure that the models 
used to generate the simulated data did not correspond 
exactly to the models being tested, it is certainly no sur- 
prise that the model favored by the analysis is the one 
that is most similar to the model used to simulate the 
data. Indeed, the main motivation for working with sim- 
ulated data is to have full control of the experiment - if 
the analysis had not selected the likelihood model that 
is closest to the injected noise model we would know 
that something was wrong. The next step, which we 
are actively pursuing, is to repeat our analysis using real 
LIGO/ Virgo data. 

We are not alone in promoting the idea that better 
noise modeling is key to t he g ravitational wave detection 
problem. Allen et al. 0, Il0j | have argued that parame- 
terized noise models are needed to derive robust search 
statistics. Clark et al. included a model for Sine- 
Gaussian instrument glitches as an alternative hypoth- 
esis when computing Bayesian odds ratios for gravita- 
tional wave signals from pulsar glitches. Clark et al. also 
suggested that it would be valuable to have a classifi- 
cation of instrument glitches that could be used to con- 
struct better models for the instrument noise. Similar 
"glitch hypotheses" were considered by Veitch and Vec- 
chio [lH in a study of Bayesian model selection applied to 
the search for black hole inspiral signals. The possibility 
of subtracting instrument glitches from the data using 
a physical model of the detector has been investigated 
by Ajith et al. [U, Q. Principe and Pinto have 
introduced a physically motivated model for the glitch 
contribution to the instrument noise, and have used this 
model to derive what they refer to as a "locally optimum 
network detection statistic" [l6j]. In their approach the 
glitches are treated in a statistical sense, while our ap- 
proach directly models the glitches present in the data. 
The possibility of directly deriving likelihood functions 
from the data in a Bayesian setting has previously been 
considered by Cannon [17j . In Cannon's approach the 
data is first reduced to an n-tuple of quantities, such 
as the parameters produced by a matched filter search, 
then using time slides and signal injections the likelihood 
distributions for the signal and noise hypotheses are di- 
rectly estimated from the data. These are then used to 
estimate the posterior probability that a measured set of 
parameters corresponds to a gravitational wave event. 

We develop our approach as follows: In section [TT] we 
discuss the connection between noise models and likeli- 
hood functions, and in scction Hni we describe the wavelet 
representation that we use to define our noise models. 
Section IIVI provides a brief review of the Markov Chain 
Monte Carlo algorithm that we use to carry out the cal- 
culus of Bayesian inference. In section [V] we introduce 
our noise models and describe the simulated data sets. 
Bayesian model selection is then applied to identify the 
most effective noise model. Section IVT1 defines the black 
hole inspiral waveforms that we inject, and section IVIII 



describes the search phase of our analysis. Results are 
presented in section [Villi describing a simultaneous char- 
acterization of gravitational wave signals and instrument 
noise for two simulated data sets. We close with a dis- 
cussion of our plans for future work in section IIXI 



II. LIKELIHOOD MODELS 

For Gaussian noise there is a well known optimal solu- 
tion to the detection problem based on Wiener matched 
filtering , which can be described in terms of a detec- 
tion statistic p - the matched filter signal-to-noise ratio, 
or equivalently, the log likelihood log£ = — x 2 /2. Sup- 
pose the data s produced by a gravitational wave detec- 
tor were the superposition of instrument noise n and the 
detector's response to a passing gravitational wave h s . 
Then the signal-to-noise for a template h is defined: 



p(h) 



(gift) 

(h\hy/ 2 



and the chi-squared residual by 

X 2 (h) = (s-h\s-h). 



(1) 



(2) 



The expectation value of p is maximized, and x 2 is mini- 
mized, when the template matches the signal, h = h s . 
Here we have used the standard noise weighted inner 
product for Gaussian noise with one-sided noise spectral 
density S n {f): 



(a\b) = 2 



ab* 



Sn(f) 



df. 



(3) 



The gravitational wave detection problem is usually 
described in the frequcntist language of false alarms 
and false dismissals based on Monte Carlo studies of 
a suitably chosen detection statistic. According to the 
Neyman-Pearson theorem, the quantity p provides an op- 
timal statistic for stationary, Gaussian noise as it yields 
the maximum detection probability for a given false 
alarm probability. While there are no similar proofs of 
optimality for the p statistic when applied to the non- 
stationary and non-Gaussian noise encountered in the 
LIGO/ Virgo detectors, the matched filter SNR or closely 
related quantities are adopted as search statistics that are 
then calibrated using Monte Carlo studies of signal injec- 
tions and scrambled data. Sections of data that are iden- 
tified as being corrupted by excessive noise or transient 
instrumental artifacts (glitches) are vetoed prior to the 
tuning of the search statistic. The development of these 
vetoes is a complex topic, but the basic idea is to either 
look for correlations between glitches in the gravitational 
wave channel and the thousands of diagnostic channels, 
or to look for statistical patterns. A more detailed de- 
scription of the vetoing procedures and data quality as- 
sessment can be found in Ref. [3] . 

An alternative to tuning a statistic chosen for its per- 
formance with Gaussian noise is to look for new statistics 
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that can be shown to be near optimal for relevant forms 
of non- Gaussian noise 0, [13] • This approach is similar in 
spirit to what we are proposing. 

It is interesting to contrast the frequentist and 
Bayesian approaches to the detection problem. In the 
Bayesian approach we simply compute the posterior dis- 
tribution function p(h\s, Ad) for the waveforms h of 
model hypothesis Ad , and the associated model evidence 
p(s\Ad). These are related by Bayes' theorem to the like- 
lihood p(s\h, Ad) and the prior p(h\Ad) by 



p(h\s,M) = 



p(h\M)p(s\h,M) 
p(s\M) 



and 



p(s\M)= p(h\M)p(s\h,M)dh. 



(4) 



(5) 



Everything we want to know about the waveform model 
Ad is contained in the posterior distribution, while the 
evidence allows comparisons to be made between alter- 
native models (e.g. is a GW signal present or not). It 
is important to emphasize that the Bayesian approach 
is entirely mechanical, there are no statistics to tune or 
thresholds to set - you simply go ahead and calculate con- 
fidence intervals for the waveform parameters and odds 
ratios for competing hypotheses. On the other hand, the 
output of this mechanical process is only as good as the 
inputs - if the waveform model or the likelihood func- 
tion is flawed, then the conclusion will also be flawed. 
We will assume that the waveform model is accurate (see 
Ref. [IH for a discussion of the biases introduced by inac- 
curate waveform models), and concentrate our attention 
on the likelihood function. 

The data collected by one or more gravitational wave 
detectors can be represented by a discrete collection of 
samples Si. These may represent the strain sampled at 
certain times, Fourier amplitudes, wavelet amplitudes 
etc. In the absence of a gravitational wave signal, Si = rii, 
and we are looking at samples drawn from the instrument 
noise distribution pn^i)- When a signal hi is present 
we have n.; = Si — hi, and assuming that the noise and 
the signal are uncorrelated, the likelihood of observing Si 
given the waveform hi is simply [2Cj 



p(si\hi, Ad) = pn(s4 - hi) , 



(G) 



while the likelihood of observing the full data set 
{si, S2, . . . , sn} is the joint probability 

p(s\h, M) = pn(si - hi,s 2 - h 2 ,---,s N - h N ) . (7) 

If the noise in each sample is independent, then the joint 
distribution is the product of the individual distributions: 



/v 



j(s\h,M) = Y[pn(s x - hi) . 



(8) 



The key point is that the likelihood function is nothing 
other than the probability distribution that describes the 



instrument noise. And therein lies the problem: The de- 
scription of the likelihood function, and hence the verac- 
ity of the Bayesian approach, is only as good as our un- 
derstanding of the instrument noise. A solution to this 
problem is to introduce a model for the noise Q, and 
to use the data to jointly estimate the noise and signal 
parameters. We have taken this approach to a limited 
extent in previous analyses where we treated the noise as 
Gaussian, but with its variance to be determined from 
the data [2lM23l | . Here we extend the scope of the noise 
modeling to consider non-Gaussian tails and inter-sample 
noise correlations of the type caused by semi-coherent 
"glitches" . 
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FIG. 1: A time-frequency scalcogram showing a loud 
glitch in the output of one of the LIGO detectors 
(LIGO document number LIGO-G070807-00-Z). 



Recall that for white Gaussian noise the samples rij 
will be independent in time and in frequency, while for 
colored Gaussian noise the samples will be independent 
in frequency and correlated in time. To avoid the com- 
plications of having to account for these correlations 
with a joint probability distribution, it is standard prac- 
tice to compute the likelihood in the frequency domain, 
where the likelihood factors into a simple product po| . 
However, as is clear from Figure [TJ the glitches seen in 
the LIGO/ Virgo data are semi-coherent in time and fre- 
quency - that is, if there is excess noise in one time fre- 
quency pixel, then there is a high probability that there 
will be excess noise in a neighboring time- frequency pixel. 
These glitches not only produce large non-Gaussian tails, 
they also introduce strong correlations in the noise model. 
While there are many ways to try to model this behav- 
ior, we will show that treating the glitches as coherent 
instrumental artifacts provides an effective solution. 
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III. DATA ANALYSIS IN THE WAVELET 
DOMAIN 

The first step towards constructing a realistic noise 
model is to abandon the familiar Fourier-domain ap- 
proach to signal processing in favor of a timc-frcqucncy 
decomposition in the wavelet domain. 

Time-frequency methods are employed in a variety of 
GW search algorithms for LIGO/ Virgo data, particu- 
larly in searches for gravitational wave "bursts" - tran- 
sient gravitational wave signals for which no templates 
are available - such as the coherent Wave Burst (cWB) 
algorithm [24|. 

Individual wavelet functions ipij are compact in both 
frequency and time, and can be used to form an orthog- 
onal basis [25j]. Each clement of the basis is a scaled, 
time-shifted version of the "mother wavelet" ip of which 
a large variety exist. We have chosen the Meyer wavelet 
to form our basis functions as is used in cWB. Fig. [2] 
depicts an example of the Meyer wavelet in the time do- 
main, as well as its power spectrum. The Fourier power 
of a single wavelet function is perhaps most illuminating. 
Each wavelet basis function acts as a bandpass filter, se- 
lecting for a specific range of frequencies, during a specific 
time interval of the function being decomposed. 

We use the discrete wavelet transform (DWT) of a 
time scries s(t) to determine the wavelet coefficients Wij. 
Each Wij represents the amplitude of a pixel in the time- 
frequency plane with volume At Af = 1/2. The indices i 
and j correspond to the frequency and time, respectively, 
of the wavelet coefficient. 

Unique to the DWT is the way the pixels "tile" the 
time-frequency plane: Low frequency wavelets have long 
durations and narrow frequency response. As we move to 
higher frequency the bandwidth of the wavelet increases 
while the duration shortens (maintaining AfAt = 1/2). 
A cartoon representation of this timc-frcqucncy tiling can 
be found in Fig. [3] Each frequency layer, denoted by 
the index i, contains 2 l divisions in time and, for signal 
duration of T, is 2 i ~ 1 /T wide in frequency. 

For stationary, Gaussian noise, the wavelet coefficients 
are sufficiently uncorrelated to treat the wavelet-domain 
noise-correlation matrix as diagonal. In this instance the 
expectation value of the noise power for the wavelet pixel 
with indices i and j, Sij, can be computed by filtering 
the noise power spectral density S n (f) with P$(f) = 

l^(/)| 2 : 

POO 

Sy= / P%(f)Sn(f)df. (9) 
Jo 

The power spectrum for any wavelet in layer i, as well 
as the power spectral density at times j are the same for 
stationary, Gaussian noise. Including the time index will 
prove useful when we allow for noise varying with time. 
We are now able to define a diagonal noisc-wcightcd 



inner product as 

k i,j V 

where the index k represents an individual interferome- 
ter, or IFO, in the network. For Gaussian noise, the inner 
products computed in the wavelet domain using (|10[) are 
almost identical to the the inner products computed in 
the Fourier domain using (|3|). 



IV. MARKOV CHAIN MONTE CARLO 

Baycsian methods have lagged behind their frequcntist 
brethren because of the burden associated with evaluat- 
ing Eqs. (Ql and ((SJ. Recently, powerful techniques for 
overcoming these computational hurdles (e.g., Markov 
Chain Monte Carlo (MCMC) methods Hi, [27(, Nested 
Sampling [28| . etc.) have become progressively more ef- 
ficient and computer power has increased, allowing us 
to employ these resources on interesting GW detection 
problems [Hill. 

Our method of choice for computing the posterior dis- 
tribution function are the MCMC algorithms. These 
techniques have shown prowess in solving parameter es- 
timation problems, while variants of these methods have 
also proven effective as search algorithms [3p - [38l ]. Here 
we briefly describe the basis of the algorithms, and leave 
specific details of our implementation to later sections. 

MCMCs provide samples from the (previously un- 
known) target posterior distribution function for the pa- 
rameters of some model Ai . This is accomplished by first 
adopting a position in parameter space 9 X as the first 
"link" in the chain and evaluating that position's like- 
lihood p(s\9 x ,M.) and prior probability p(6 x \M). Next, 
we suggest a trial position, 9 y , from the proposal distribu- 
tion q(6 v \6 x ) read "the probability of suggesting a move 
to 9 y given that the current location is 6 X ", The new 

likelihood and prior probability are evaluated and 9 y is 
adopted as the next link in the chain with probability 
k = min[l, H] where H is the Hastings ratio 

^ ^ p(s\0v,M)p(0y\M)g(0 x \0y) 
e * p(s\e x , M )p{6 x \M)q{6 y \e x )' 

This process of stochastically stepping through parame- 
ter space repeats until some convergence criteria are sat- 
isfied (See Ref. [23| for a description of the convergence 
tests we use). Afterwards, the number of iterations spent 
in a particular region of parameter space, normalized by 
the total number of steps in the chain, yields the proba- 
bility that the model parameters have values within that 
region. 

The Hastings ratio is derived by mandating transitions 
from 9 X to 9 y satisfy the detailed balance condition which 



(a) time 



(b) frequency 



FIG. 2: The Meyer wavelet basis function for frequency layer i = 9 and signal duration of 16 seconds, a) Time 
domain ipij(t) f° r arbitrary time index j. b) Fourier power spectrum of ipij(f)- Notice how, for i — 9 and T = 16 s, 
this wavelet acts as a bandpass filter for frequencies / ~ [32, 64] Hz. 



FIG. 3: A cartoon depicting the tiling of the 
time-frequency plane by a discrete wavelet transform 
(DWT). Each pixel has time-frequency volume 
AiA/ = 1/2. 



ensures the samples from the Markov chain are represen- 
tative of the target PDF. 

The choice of q(6 y \9 x ), by construction, can not alter 
the recovered posterior distribution function. The pro- 
posal distribution does, however, dramatically affect the 
acceptance rate of trial locations in parameter space and, 
therefore, the number of iterations required to satisfac- 
torily sample the joint PDF. 

Markov chains are still prone to being "trapped" by 
local maxima of the target distribution for longer than 



the user would be willing to wait. Some enhancement 
to the above prescription is often mandatory in order to 
ensure global sam pling of the PDF. One such method, 
parallel tempering [39( , uses a set of chains running si- 
multaneously, each at a higher "temperature." The like- 
lihood of a chain with inverse temperature (3 = 1/T is 
calculated by p(s\6, A4)@ . Increasing the temperature 
effectively smoothes the topography of the distribution 
being explored. High temperature (low /?) suppresses 
the differences in likelihood between the current and pro- 
posed model parameters, allowing the high temperature 
chains free exploration of the parameter space. In the 
limit where the temperature goes to infinity (/3 goes to 
zero) the chain will sample the prior distributions. 

Parallel tempering promotes adequate mixing by al- 
lowing the chains of different temperature to exchange 
parameters, thus allowing solutions found by the hot 
chains to be communicated to the colder chains. This 
sharing of solutions can be executed while maintaining 
detailed balance if exchanges are accepted using 



H 



P (s|e i ,M,ft)p(s|fl 3 -,.M J -,ft) 

p(*|$,M,A)p(*&,A4i,&) 



(12) 



as the Hasting's ratio in the transition probability for 
an exchange between chain i and j. Only points in the 
(3 = 1 chain sample the target distribution and are there- 
fore permitted to contribute to the chain from which the 
PDF is inferred. However, by exchanging parameters 
with hotter chains the j3 = 1 samples rapidly explore 
the full target distribution, including movement between 
different modes of the posterior which, for practical ap- 
plications, can be impossible for a single chain to achieve. 
While only the f3 = 1 chain contributes to the PDF, the 
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rest of the chains serve as more than just a convergence 
aid. The average log-likelihood of each chain, integrated 
over P = [0, 1], is equivalent to the integral in ([5]) [ioj ]. 
allowing us to perform parameter estimation and model 
selection studies using a single analysis "pipeline." 

It should be noted that simultaneously solving the pa- 
rameter estimation and model selection problem is not a 
unique attribute of the parallel tempered Markov chain 
Monte Carlo (PTMCMC) algorithm. On the contrary, 
the MultiNest algorithm [4l|, |42| also serves as an "all-in- 
one" Bayesian analysis package that has been successfully 
employed in GW data analysis problems. While we will 
exclusively use the PTMCMC approach in this paper we 
want to emphasize that these algorithms are merely tools 
used to mechanically perform the calculation of (|4]) and 
([5]) , and while some tools may be better suited for a par- 
ticular problem than others, the conclusions drawn from 
the data should be identical. 



V. MODELING THE NOISE 

We now return to our goal of finding the best noise 
model (equivalcntly the best likelihood function) to be 
used when describing realistic GW data. We will con- 
struct three different likelihood functions and compare 
their performance on simulated non-Gaussian noise us- 
ing Bayesian model selection. What follows is a brief 
summary of each noise model before a more detailed de- 
scription and comparison: 

• Model Nq: Allow the noise level to vary as a func- 
tion of both time and frequency. 

• Model N\: Use a likelihood function derived from 
a distribution which has non-Gaussian tails. 

• Model G\\ In conjunction with No or N\, model 
the noise as two independent contributions - a 
stochastic component drawn from Ni plus coher- 
ent "glitches." 



A. Nq: Fitting to the noise level 

The first order solution is to fit to the underlying "DC" 
noise level in the data. This has been done in frequency 
domain analyses [2l| - [23| allowing the model enough flexi- 
bility to account for errors in the predicted S n {f). In the 
wavelet basis this approach to noise modeling gains sensi- 
tivity to non-stationarities because temporal information 
is encoded in the noise "spectrum." 

For this model, parameterized by ff, we use the ex- 
pected noise level in "blocks" of wavelet pixels contain- 
ing a certain time-frequency volume (TFV) normalized 
by the theoretical noise level, 



with a single value of r\ assigned to each TFV block: 
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ij, measured 
ok 

ij, theoretical 



(13) 



%0 — %30+l 



Virfo+TFV-l 



(14) 



where jo is the time index of the first wavelet coefficient 
in the block under consideration. Because the noise level 
is allowed to vary, the normalization of the likelihood 
and the noise weighted inner product in the chi-squarcd 
depend on the noise parameters: 



In p(s\ff, N ) = -- [ (r\r 



IFO 



EE 1 - 



(15) 



k i,j 



with rjjjSjj substituted for S*. in the inner product de- 
fined by (jlU)) . The second term in (|T5"j) comes from the 
normalization of the likelihood. Note that for a noise- 
only model, the residual r is simply the data s. 

If we fit to the noise level in large time- frequency blocks 
we lose sensitivity to short-duration non-stationarities, 
while using small time-frequency blocks results in a 
model with a large number of parameters, all of which 
must be constrained. In a model selection sense, the 
former will produce poor fits to the data (if non- 
stationarities are present) while the latter will carry a 
large "Occam Factor" penalty for the model's additional 
degrees of freedom. Some tuning is required to choose 
the correct TFV for the noise blocks in order to optimize 
this approach. While No is a non-stationary noise model, 
it is unable to respond to short duration impulses of noise 
unless it is equipped with an unreasonable number of free 
parameters. 



B. Nn Non-Gaussian Tails 

This noise model, denoted by N±, will continue to fit 
to the noise level in blocks of wavelet pixels as in TVn, 
while additionally employing the suggestion in Ref. Q 
to redefine the likelihood as a weighted sum of two nor- 
mal distributions. The majority of the weight will be 
allocated to the distribution with the same variance as 
a Gaussian model so that the bulk of noise samples are 
presumably "drawn from" the standard picture of the 
instrument noise. However, some non-zero contribution 
to the likelihood comes from a significantly wider distri- 
bution. This broadens the tails of the noise distribution 
without greatly altering it's core. In other words, the ma- 
jority of samples drawn from this distribution will "look" 
as though they come from the ideal noise distribution, 
however the frequency of "outliers" is greatly increased. 

Symbolically, the probability of measuring m in a sin- 
gle bin of the data can be calculated as 



P(rn) 



1 



'2-k \ cr i 



+ 



1 — a 
( 

CT 2 



2 In** 



(16) 



To demonstrate the effect of this distribution, suppose 
we choose 02 = 3<ri and a = 0.99. Then, for a single 
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noise sample, 



and 



p(n = 0\Ni) 
p(n = 0\N ) 



p{n = 10cti|7Vi) 
p(n = IOcti | JV ) 



0.993 



1.970 x 10 18 . 



(17) 



(18) 



We see that for samples near the mean, the difference be- 
tween the two distributions is negligible, while there is a 
large increase in the probability of large noise excursions 
under N\. 

Following the discussion deriving Eq. <JSj> , the likeli- 
hood of measuring the residual r in the wavelet domain 
for N\ becomes 



p(s\fJ,N 1 )=l Jp(4). 



(19) 



Using a superposition of two normal distributions is a 
standard method in Bayesian modeling of non-gaussian 
data [43l ] but has not previously been implemented in 
gravitational wave signal processing. 



C. Gi: Fitting for Glitches 

While both A^o and N\ possess many of the qualities 
which we desire in an accurate noise model, both assume 
that each noise sample in the data is independent (in 
the wavelet domain). Instrument glitches, on the other 
hand, are coherent in time-frequency space. A wavelet 
scaleogram representation, depicted in Figure [TJ is a use- 
ful visualization of this feature, as the glitch can be seen 
"lighting up" a large cluster of pixels, while the remaining 
data samples are well described as being independently, 
randomly distributed. 

We require some way of modeling the individual 
glitches, and a template bank of such events would un- 
doubtedly be inefficient and incomplete. Instead we can 
once more make use of the wavelet basis, and model 
the glitches as linear combinations of the basis functions. 
Studies of LIGO/Virgo glitches using atoms (e.g., sine- 
Gaussians) have shown that typical events can be decom- 
posed with ~ 10 basis functions [IUdll so it is reasonable 
to anticipate a similarly small number of wavelets will be 
sufficient. The parameters for the glitch fitting are the 
number of wavelets included in the fit, no, the indices 
identifying where in time-frequency space those wavelets 
(referred to as "hot pixels") are located, and their am- 
plitudes 7. 

It is unknown a priori how many, or where in time- 
frequency space, these hot pixels will be needed. Given 
this uncertainty we must automatically adjust the num- 
ber and location of wavelet coefficients in the glitch fit- 
ting, a task best accomplished by the Reversible Jump 
Markov Chain Monte Carlo (RJMCMC) approach. 



This breed of MCMC is able to transition between models 
of differing dimension while satisfying detailed balance if 
the probability of a trans-dimensional move is computed 
using 



H 



p{s\6i,Mi)P0i, MM0jMi) 



(20) 



as the Hasting's ratio in the transition probability. The 
Jacobian | J^- 1 accounts for any change in dimension be- 
tween models Aii and M j . The number of samples spent 
in each model, normalized by the total number of sam- 
ples in the chain, is the posterior probability, or evidence, 
for that model (assuming adequate mixing, convergence, 
etc.). 

In our glitch- fitting RJMCMC, trans-dimensional 
moves propose to either remove a single pixel from the 
glitch model (setting a wavelet coefficient to zero), or in- 
clude a new pixel which was previously not a part of the 
glitch model (selecting a wavelet coefficient which is zero, 
and assigning it some amplitude). 

Because the number and location of hot pixels is vari- 
able, model Gi is more aptly described as a "meta- 
model," while a particular number and configuration of 
hot pixels represents a discrete model for the noise. We 
can denote an individual glitch model by it's number of 
hot pixels, riQ. For model uq and data containing N 
samples, there are il N choose hq" — K n unique combi- 
nations of hot pixels. We want each configuration to have 
equal prior probability, so the joint prior probability for 
each noise model configuration is 



Pindd) = l/K, 



n G \{N-n G )\ 
N\ 



(21) 



Meanwhile, we choose as the prior on the pixel ampli- 
tudes p^ijlna, Gi) = N[0, 100Sy]. 

RJMCMCs are often difficult to work with because of 
inefficient mixing between dimensions. The key to over- 
coming this obstruction is to construct trans-dimensional 
proposal distributions which closely match the target dis- 
tribution. We do so by taking advantage of a glitch's 
tendency to excite a cluster of pixels in time-frequency 
space. The trans-dimensional proposal distributions thus 
favor the inclusion of new pixels in the glitch model which 
are adjacent to the already existing clusters. We do so by 
assigning each unoccupied pixel a probability of being se- 
lected for inclusion in the glitch model which is weighted 
by the number of neighboring pixels already "lit up." 
Meanwhile, the trans-dimensional proposal distribution 
for the glitch amplitudes is identical to that of the prior 
on the amplitudes, thus canceling in the Hasting's ratio. 



D. Simulating Non-Gaussian Noise 

To test our approaches to noise modeling, before ap- 
plying these techniques to data collected by the LIGO- 
Virgo network, we will analyze simulated data containing 
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Gaussian noise with injected glitches. We use a variety 
of glitch injections to ensure that any success is not a 
result of tuning the methods to a particular noise real- 
ization. We begin by generating Fourier-domain noise for 
a 16 second data segment using the initial LIGO/ Virgo 
design sensitivities. We can inject glitches of two differ- 
ent type into the data. One which we will refer to as the 
"gaussian-gaussian" type, is a burst of gaussian noise en- 
veloped by a gaussian profile in time. The second are the 
glitch atoms described in Ref. [lU . 

Glitches are either placed in the data deliberately, or 
randomly chosen such that the glitches are distributed 
uniformly in the time-frequency plane, with average du- 
rations of 10 "cycles." Each generated glitch go, is 
rescaled to achieve a desired "SNR" = (g\g). The 
glitch SNR, x, is either chosen by hand, or drawn from a 
double-power law distribution 

p(x) = C (V 4 + x~ 5/ V 3/2 ) , x > x (22) 

calibrated to the Black Hole ring-down trigger rate in a 
segment of "typical" LIGO S5 data. G is a normalization 
constant and xi ~ 10. The total number of injections 
was also determined by the ring-down trigger rates. This 
distribution produces glitches which are predominantly 
buried by the Gaussian component of the noise, with only 
a rare (~ 1 in 5) noise realization containing an event 
visually above the Gaussian noise. By injecting glitches 
from this distribution we are ruining the stationarity and 
Gaussianity of the noise without using our models for the 
noise to produce the samples. 

E. Comparison of Noise Models 

To compare the relative performance of our new ap- 
proaches to noise modeling, we analyzed three differ- 
ent signal realizations, s\\ Gaussian noise; s^'. Gaussian 
noise plus a single, loud (SNR ~ 100) glitch; and S3: 
Gaussian noise plus 300 glitches distributed across the 
network with SNRs drawn from (f2"2"| . We use the same 
Gaussian noise realization in each simulation. 

Each data set was analyzed using the four different 
combinations of noise models described above: [No, Go], 
[TVi, G ], [JV 0) Gi], and [N 1: Gi], for noise blocks of differ- 
ent time frequency volume (TFV). Histograms showing 
the evidence for each model as a function of log 2 (TFV), 
as well as the posterior for the hot-pixel number no for 
S3, are displayed in Figure 

We will first address the dependence of the model evi- 
dence on TFV. For si and S3 the data were dominated by 
Gaussian noise, while S2 contained a significantly "loud" , 
short duration, glitch. Intuitively, noise models with 
fewer degrees of freedom are preferable when the noise 
level from block to block hardly strays from the theoreti- 
cal prediction (si, and to some extent, S3), while models 
with more flexibility should be favored when a large de- 
viation from Gaussian noise exists. This presumption is 



supported by the results as we see the evidence improves 
with increasing TFV (decreasing number of noise param- 
eters) for si and S3, and diminishes (considering models 
without glitch fitting (Go)) for S2. 

This study has also allowed us to make some con- 
clusions about selecting between these different models 
for non-Gaussian noise. We anticipated that the favored 
model would most qualitatively resemble the noise sim- 
ulation, to wit, [No, Go] for si, [No, Gi] for S2, and [Ni, 
Go] for S3. This prediction, however, is not entirely sup- 
ported by the results. 

For si we see that [No, Go] is on either equal footing 
(for small TFV), or mildly disfavored (at large TFV), 
when competing with [Ni, Go] and [No, G\]. While 
[TVo, Go] is the noise model which most faithfully repre- 
sents how the noise for si was generated, it's inflexibility 
when compared to [No, G\] and [N\, Go] makes it less 
well suited to cope with a particular noise realization. It 
should be pointed out, however, that the differences in 
evidence in this example are not significant in a model- 
selection sense. 

For S2 we see a strong preference for models which in- 
clude glitch fitting (Gi). This is to be expected when 
the data contain a large amplitude noise impulse, while 
the Go models are more adept at describing instrument 
noise with more frequent, lower amplitude, and uncorre- 
lated noise excursions. 

The data simulation which is arguably the most rele- 
vant to our ambition of including this work in the analysis 
of current gravitational wave data would be S3 . Here we 
have a large number of glitches (300) injected into the 
data with SNRs drawn from a distribution calibrated to 
match the frequency of black hole ring-down triggers in 
S5 data. Our motivation for this realization is to produce 
a signal that is most like the actual interferometer data. 
The resulting noise "looks" predominantly Gaussian, as 
the vast majority of the glitches are well below the nor- 
mal instrument noise level. However, when the data are 
analyzed with the noise models we see a generic disfa- 
voring of the Gaussian model [No, Go]- Furthermore, we 
are pleasantly surprised to see that [No, G\] performs as 
well as [Ni, Go] in this example even though [N\, Go] is 
qualitatively best suited for this noise simulation. 

The final point of interest regarding the performance of 
the different noise models is the generally underwhelming 
performance of [N\, G{]. The expectation going in to the 
study was that it would outperform its counterparts on 
S3 , as it was best suited to handle both the rare impulsive 
events, and the overall non-Gaussian component from the 
superposition of many small-amplitude glitches. To ex- 
plain why it generally underpcrforms when compared to 
the other two non-Gaussian models we refer to Fig. 0J1. 
The plot shows the posterior distribution functions for 
the number of hot pixels tlq used by the glitch fitting. 
The Gaussian likelihood model, [No, G\] (red), has a 
peak in the posterior at uq = 4, while [Ni, Gi]'s peak 
(blue) is closer to zero, at uq = 1. If the noise model uses 
a Gaussian distribution for the likelihood some number of 



9 




log;, TFV 
(c) S3 



No-go I 
N,.G„ 




(d) p(m\Ni,Gi) for s 3 



FIG. 4: Plots (a)-(c) show the log evidence for different noise models applied to signal realizations (si, sa, S3) as a 
function of the TFV used by Nq and Ni. Models with discernibly higher evidence are preferred. Plot (d) shows the 
distribution of the "hot-pixel" number used by G\ for the signal realization S3. Our interpretation of these results is 

discussed in the text. 



hot pixels are required to achieve a sufficiently Gaussian 
residual, and G\ has the flexibility to do so. However, 
if the likelihood accounts for these non-Gaussian excur- 
sions by having more weight at large a, fewer hot pixels 
are needed. N% is, by itself, an accurate representation 
of the noise and adding G\ to the modeling introduces 
additional degrees of freedom without a substantial ben- 
efit to the overall fit, making it a less attractive choice in 
a model selection sense. 

Another unanticipated result can be seen in Fig. @Jl. 
We generically sec that the posterior distribution on no 
is always peaked away from zero. This can be understood 
by studying the Hasting's ratio for transitions from uq = 
to tiq = 1. Under such transitions, 

_ p{s\l)p{i l3 \l)p{ ll0 \l)g(i, 3 \0) g( 7lJ |0) 

P (s\0) P (i,j\0) P ( 7ij \0)q(i,j\l)q( 7ij \l) 1 ) 

with 

l(h J7 1 0) = 1 : there is only one pixel to remove in 



a transition from 1 — > 0. 
q(i,j\l) = 1/iV : there are N equally likely pixels 
to include in a transition — > 1. 
j\0) = 1 : there is only one configuration of 
pixels if none of them are "hot." 
p(i,j\l) = 1/N : there are N configurations for 
a single "hot" pixel. 
Pilijl 711 ) = ?(7iil m ) : by construction. 

and Hq^i reduces to the likelihood ratio. Therefore, as 
long as the inclusion of a single hot pixel does not in- 
crease the residual, J?(s|l) > p(s\0), -f/o->i > lj and the 
transition to uq = 1 is accepted. 

The results of these noise model comparisons lead us to 
the following conclusion: While more than one model is 
preferred under different circumstances (noisc/glitch re- 
alizations), the model which uses a Gaussian likelihood 
along with glitch fitting was never seen to be disfavored 
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in comparison to the other models under consideration. 
Given this generic utility, we will adopt [No, G\\ as our 
default noise model while turning our attention to the 
model selection challenge in which we are more funda- 
mentally interested: Distinguishing between a gravita- 
tional wave signal and non-Gaussian noise. 



VI. MODELING THE GRAVITATIONAL 
WAVES 

We take as our test signal the inspiral of a binary sys- 
tem composed of two stellar-mass black holes. The full 
gravitational waveform from the inspiral of binary black 
holes is composed of an inspiral, merger, and ringdown 
phase. Analytic solutions exist for the inspiral and ring- 
down waveforms courtesy of post-Newtonian and black 
hole perturbation theory, respectively Recent advances 
in Numerical Relativity have led to the simulation of 
black-hole mergers, allowing (for particular combinations 
of mass and spin) the construction of full waveforms. An- 
alytic models of the full waveforms, calibrated against 
the numerical simulations, provide accurate waveforms 
which cover the entire signal (see e.g [H, E3]). While a 
true analysis pipeline should utilize all of the available 
signal, for this proof-of-principlc effort we will only simu- 
late the inspiral phase of waveform both for our injections 
as well as our analysis. 

While neglecting the post-inspiral part of the wave- 
forms, we apply additional simplifications by ignoring 
spin effects, as well as any orbital eccentricity. What 
remains is a model of the waveforms which is charac- 
terized by nine quantities written as components of the 
signal-model parameter vector: 

(m,M,t c ,logD L ,sm5,a,cos6 L ,<l) L , ip c ) (24) 

where m and M. are the total and chirp mass of the bi- 
nary system, t c is the time when the binary coalesces 
(more accurately, when the post-Newtonian approxima- 
tion to the frequency diverges), Dl is the luminosity dis- 
tance to the system, 5 and a are the declination and right 
ascension as defined on the celestial sphere, 6l and <j>L are 
the polar coordinates of the angular momentum vector 
for the binary, and ip c is the GW phase at coalescence. 
We will further simplify the waveform by only consider- 
ing the quadrupolc radiation, ignoring higher harmonics. 
For parameter estimation studies which incorporate the 
details that we have ignored see, for example, 0, \I§t - 



VII. THE DETECTION ALGORITHM 

Armed with models for the noise and the signals, we 
now wish to test their relative distinguishability on simu- 
lated data using the PTMCMC detection algorithm. We 



will consider four models: 

No,Go,B :s(t,f) = n(t,f), 
iVo,Go,5i:s(i,/) = n(t, f) + h(t, f), 
JV ,Gi,S :s(t,f) = n(t,f)+g(t,f), 
N , G lf fl x : s(t, f) = n(t, f) + g(t, /) + h(t, /), (25) 

and for each, thoroughly resolve the PDF of the model 
parameters and calculate the evidence. We will consider 
the prior odds between the models to be unity so that 
differences in model evidence are equal to differences in 
the posterior probability for each model. Henceforth, we 
will omit writing No as part of the model identification 
as it is common in all scenario's considered here. 

The detection algorithm is broadly divided into three 
phases. For each model under consideration the algo- 
rithm performs a: 

• Search: To locate the regions of high probability in 
parameter space. 

• Characterization: To globally sample the posterior 
distribution for the model parameters. 

• Evaluation: To calculate model evidence and de- 
termine which (if any) of the models are favored. 

The stages of the algorithm arc linked automati- 
cally, making this an "end-to-end" analysis pipeline for 
matched-filtering searches. Generally, the search phase 
sacrifices detailed balance in favor of rapid convergence, 
and uses a modest number of parallel chains (~ 10). In- 
formation from the search is used to construct informed 
proposal distributions for the characterization, where we 
must take care that the chains are Markovian. During 
the characterization phase we employ a stronger dose of 
parallel tempering (~ 30 chains) to thoroughly sample 
the evidence integrand. When using thermodynamic in- 
tegration of the evidence, the evaluation phase occurs 
in post-processing as all of the necessary information is 
acquired during the characterization phase. 

The above description of the detection algorithm is a 
very general outline of the procedure. What follows is a 
more detailed look into the implementation of each step 
for this study. 

A. The Glitch Search 

We first analyze the data from each detector individu- 
ally with the glitch-only model (G\, Bo), to establish the 
range of hot pixels over which the model posterior has 
significant support. The noise in each detector is inde- 
pendent, as is the noise model (including glitch fitting). 
Therefore, we need not perform this phase of the analysis 
simultaneously across the network. 

While the [Gi, Bo] chains are running in their post- 
burn-in stage, the state of the glitch model (i.e., the 
number, location, and amplitude of non-zero wavelet co- 
efficients used in the glitch fitting) is periodically stored 
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so that we retain a population of samples from the glitch 
search chains. 

This "glitch cleaning" approach would not work if the 
data contained un-modeled bursts of gravitational waves 
as the wavelets efficiently fit the astrophysical signal (as 
is demonstrated in the wavelet-domain LSC burst search 
algorithms). We can perform the analysis in stages here 
because the power in a single bin of data from a black hole 
binary signal is swamped by the instrument noise, and 
it is only with matched filtering that we can effectively 
elevate these signals above the noise. The glitch model is 
therefore "blind" to the BH waveform in the data, and is 
left to clean the noise down to some Gaussian residual. In 
§ 11X1 we describe an improved approach that combines the 
glitch fitting in each detector with a search for coherent 
power across the detector network. 



B. The Gravitational Wave Search 

The goal of the GW search phase is to locate the modes 
of the posterior distribution function. Straight forward 
implementations of MCMCs converge slowly - impracti- 
cally so when exploring large spaces of high dimensional- 
ity. Several methods exist for optimizing a search using 
the Metropolis-Hastings algorithm as the driver for the 
exploration [31rl38l ]. The most effective are those which 
either violate detailed balance and/or sample from a bi- 
ased target distribution (e.g. simulated annealing, F- 
statistic extremization, "mode- hopping" proposals, etc.). 
There do exist enhancements which preserve the inde- 
pendence of the chain samples, such as delayed rejection, 
but we have found the implementation of such methods 
unnecessarily complicated. 

Our approach is to use "illegal" search techniques to 
locate the regions of high posterior weight and to use the 
biased PDF from this search as one of the proposal dis- 
tributions used to sample from the target distribution. 
Specifically, we maximize the likelihood over a subset of 
the extrinsic parameters, {t c , tp c , Dl} using the correla- 
tion of the template with the data. This maximization 
must be done in the Fourier domain and assumes gaus- 
sian noise. The search is therefore performed in frequency 
space and in the absence of noise fitting. Consequently, 
the search-phase inner products use the initial LIGO and 
Virgo theoretical noise PSD. 

The disadvantage to this Fourier-domain, biased search 
phase lies in our inability to do any glitch fitting during 
this stage. To prevent the searching chains from spending 
time fitting to glitches, we make use of the stored states 
from the glitch removal phase. A state of the (already 
finished) glitch chain is randomly chosen, and subtracted 
from the data. The subtraction is done in the time- 
domain, and occurs periodically, about once every 1000 
iterations, so the search phase sees the residual from dif- 
ferent realizations of the glitch-only analysis. Performing 
this maneuver while maintaining detailed balance would 
be a challenge but because wc are merely localizing the 



modes of the posterior (and not claiming that wc are ac- 
curately sampling the PDF), we can get away with these 
violations. 



C. Characterization and Evaluation 

To characterize the signal and glitch model, and evalu- 
ate each model's evidence, we repeat the analysis with the 
full implementation of parallel tempering and noise fit- 
ting while satisfying detailed balance and sampling from 
the target distribution. Wc greatly expedite this phase 
of the analysis by including the proposal distribution 
that was constructed from the (biased) samples of the 
GW search phase. Recall that the proposal distribution, 
by construction of the Hastings ratio, can not bias the 
chain's sampling of the target distribution. Even though 
the search-chain samples were obtained "illegally', they 
can be used to help move the chains between modes of 
the target distribution. 

To form this efficient proposal distribution, we sort the 
GW search chain parameters into a joint histogram with 
bin widths (in each parameter direction) equal to some 
multiple of the standard deviation for that parameter as 
estimated by the Fisher Information Matrix. The Fisher 
Information Matrix is evaluated at the search chain's 
Maximum a Posteriori (MAP) parameters. The tech- 
nique of constructing a proposal distribution for a high- 
dimensional space which has been sparsely covere d by 
a Markov chain was described with more detail in [23| . 
Binning the search chains into a proposal distribution can 
most likely be made more efficient by using binary space 
partitioning algorithms - a technique that has previously 
been studied for use in LIGO /Virgo analyses to approxi- 
mate the posterior distribution from a Markov chain [52| . 

Provided the resulting 9-dimcnsional histogram built 
out of the search-chain samples is normalized and allows 
access to the entire prior volume it is a perfectly valid 
proposal distribution from which we can draw new pa- 
rameter combinations (taking care to accurately calcu- 
late the q(x\y) / q(x\y) term in the Hastings ratio). This 
proposal distribution, although biased by the likelihood 
maximization from the search phase, is a sufficiently ac- 
curate approximation to the true PDF to improve the 
mixing of the chains during the characterization phase. 
The use of parallel tempering, coupled with the pilot ex- 
ploration of the posterior, leads to rapid (less than 10 3 it- 
erations) convergence and supplies further assurance that 
the chain is globally sampling the target distribution. 

We further expedite matters by restricting the number 
of hot pixels in the glitch model to those which had sig- 
nificant weight in the search-phase posterior. Effectively, 
we are reducing the number of glitch models under con- 
sideration and neglecting those which would contribute 
negligible weight to the evidence. 

Upon the completion of the characterization chains, 
the average likelihood for each chain is computed, and 
the evidence is calculated using a simple trapezoid inte- 
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gration over the inverse temperature. 



D. Prior distributions 

We use uninformative (flat) priors on all of the signal 
model parameters with angular variables taking the full 
allowed range. We allow for the individual constituents of 
the binary to have masses ranging from 1 M Q to 30 M Q . 
The distance to the binary is constrained between 1CP 1 
and 10 3 Mpc. Because the masses are scale parameters 
we use uniform distributions in the log of the chirp and 
total mass. Each time-frequency block for which a noise 
parameter is assigned contains Ntfv = 1024 pixels. 

For more detail regarding the implementation of the 
MCMC and parallel tempering, refer to the Appendix. 



VIII. RESULTS 

The data are simulated by first creating a noise realiza- 
tion as described in $V] and then injecting a black hole 
waveform with the luminosity distance tuned to a chieve 
the desired network signal-to-noise ratio SNR = y/{h\h). 
The injected SNR is computed using the theoretical, 
Gaussian noise level, so the actual SNRs will be some- 
what lower since additional noise is injected by way of the 
glitches. We will study the performance of the detection 
algorithm on two noise simulations qualitatively similar 
to s2 and s3 from JV]over a three-detector network which 
includes both four kilometer LIGO interferometers plus 
Virgo. 



A. Example 1: Network of Detectors, Single Loud 
Glitch 



For this example, the BH injections had SNRs between 
6 and 12. The glitch injection contained a single glitch 
atom, injected into the simulated Virgo data, with a SNR 
of 100. The time-frequency location of the glitch was 
chosen to overlap with the BH waveform. 

Figure [5] shows the accuracy with which the glitch is 
removed from the data. The top panel displays the full 
16 seconds of Virgo's time-domain data (red) and the 
residual (blue) for a typical state of the chain. The lower 
panel zooms in to the region around the glitch, and in- 
cludes the injected Gaussian noise (black). The RJM- 
CMC glitch fitting procedure was able to parsimoniously 
remove the non-Gaussian component of the noise without 
significantly affecting the rest of the data. 

The distributions of the number of hot pixels (nc) used 
by the RJMCMC glitch model are shown in Figure |U 
The value of ng at the peak of these distributions is the 
glitch model with the highest evidence. The favored mod- 
els for the LIGO Livingston and Hanford detectors (LHO 
and LLO) are for low uq as the simulated noise from 
those instruments was stationary and Gaussian, while 
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FIG. 5: A typical residual from the glitch search phase. 
The top panel shows the full 16 seconds of Virgo data 
while the bottom is a zoomed in view of the glitch. The 
red line is for the raw data s(t), while the blue line is 
the residual r(t) from a randomly selected state of the 
glitch-search chain. Included in the lower panel is the 
injected Gaussian noise n(t) demonstrating the 
accuracy of the glitch removal. 




FIG. 6: PDF of tig for stationary, Gaussian noise with 
a single "loud" glitch injected into the Virgo data. The 
RJMCMC automatically prefers small values of no for 
the Gaussian noise, and ~ 25 glitch wavelets to fit the 
large amplitude noise excursion seen in Figure [5] 



we see clear evidence for a non- Gaussian feature in the 
Virgo data. For the characterization/evaluation studies 
we restrict uq in each interferometer to those in Figure [5] 
which have non-zero probability. For discussion of why 
the LLO and LHO histograms are not peaked at zero see 

m 

The black hole search was run on data with the glitch 
already removed as described in Will For demonstra- 
tive purposes, we also ran a search on the "raw" data 
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FIG. 7: Time-to-coalescence search chains with (blue) 
and without (red) glitch fitting for Example 1. Without 
glitch fitting the search locks onto a glitch in the Virgo 
detector (injection time ~ 4 s) while with glitch fitting 
the search locks onto the signal injection (t c ~ 6 s). The 
GW signal was injected with SNR of 12. The 

prescription for the glitch fitting is described in Will 



still containing the glitches. Figure [Jj shows the time- 
to-coalescence chains for the searches performed on the 
data with the BH injection at SNR of 12. The search 
templates on the raw data had t c values consistent with 
the time of the glitch injection (completely missing the 
BH signal), while the chains exploring the residual data 
found the correct value of the merger time. For the other 
data sets, the injected source parameters were success- 
fully extracted from the residual data for injections above 
signal-to-noise ratios of ~ 8. 

Upon completing the search chains, we bin the samples 
from the BH search into a 9D histogram to use as a pro- 
posal distribution (as described in Q VIII and Ref. (23|) and 
reanalyze the data using the four models outlined in the 
previous section. During this characterization phase the 
noise and signal parameters are updated together, and 
care is taken to satisfy the detailed balance condition. 

Figure [8] shows the posterior distribution functions for 
each of the BH signal parameters, marginalized over all 
other parameters and noise models, for the SNR = 12 
signal injection. Vertical lines denote the injected pa- 
rameter values, all of which lie in the supported region 
of the posterior. The familiar multi-modal structure of 
these posterior distributions is in evidence. 

The evidence for each model is calculated using ther- 
modynamic integration and the results of these calcula- 
tions are shown in Figure |9l The evidence calculation 
is un-normalized, so the value for each model is of no 
consequence. What matters is the relative evidence be- 
tween the different descriptions of the data. For all cases, 
the models with glitch fitting (Gi) dominate over those 
without (Go). For the G\ models, we begin to see sig- 
nificant support for the black- hole model [B\ ) at SNR of 



10. There is never significant support for the noise-only 
model because we set the prior range on £>£ to extend 
well beyond the range of the detectors, i.e., included in 
our GW model are signals which are undetectable. The 
model selection scheme should not distinguish between 
noise only or noise plus a gravitational wave of (near) 
zero amplitude. Had we adopted some minimum SNR 
cutoff on the luminosity distance, we would have seen the 
noise model being favored at low SNR. A similar study 
was performed for galactic binaries in Ref. (23[. 

The question of "detectability" is more clearly demon- 
strated by looking at the Bayes Factor 



Bi,o 



p(8\G u B ) 



(26) 



shown in Figure [TU1 Because of our choice to use uniform 
priors on the different models, B\.o is equivalent to the 
odds ratio for favoring the BH model to the noise-only 
model. The dashed lines represent odds ratios of 3:1 and 
12:1 which are historically taken as different "confidence" 
intervals in Bayesian model selection. For Bayes factors 
between 3:1 and 12:1 model 1 is weakly supported over 
model 0, while above 12:1 the support for model 1 is 
considered strong. We find that by SNR = 10 there is a 
clear preference for the signal model which then increases 
exponentially with the gravitational wave amplitude. 

Perhaps more importantly, we do not see false positive 
detections (i.e., the BH model fitting to the glitch with 
significant evidence) when the BH amplitude is too low 
to be detected. This contrasts with the models that treat 
the noise as Gaussian (Go) where we see strong support 
for the BH model even though the GW chains never lo- 
cated the injected signal (see Figure 



B. Example 2: Network of Detectors, Glitch 
Distribution 



Example 1 is a simplified simulation of non-Gaussian 
noise for the ground based GW detectors. To further 
test the effectiveness of our noise-modeling algorithm we 
simulated new data and re-ran the analysis described in 
Example 1 but with 100 glitches injected from a distri- 
bution calibrated off of the S5 ring-down trigger rate as 
described in S|V] Spectrograms of the glitch injections 
into each detector are shown in Figure [11] and are com- 
pared to the waveform injection which is shown using 
the same color scale. Figure shows the various con- 
tributions to the detector output for the Virgo detector, 
which happened to have the loudest glitch in this noise 
realization (at t ~ 1 second). 

The results for this second example are qualitatively 
similar to those for the first example, thus the discus- 
sion in this section is kept brief. Figures [T31 HU HU for 
Example 2 correspond to Figures 171 l9l ITOl for Example 1. 

Despite the increased complexity of this data simula- 
tion, and the increased challenge of extracting the GW 
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FIG. 8: Marginalized PDFs of A for SNR = 12 from Example 1. The vertical line marks the injected waveform's 
values. Notice the multi-modal structure of the posteriors, illustrating the challenge these signals pose for any 

analysis method. 



signal from the non-Gaussian noise, the results and con- 
clusion made from this example are essentially identical 
to those from the simpler "single glitch" case of Example 
1. The only significant differences between this and the 
previous test are the degree to which glitch fitting im- 
proves the evidence, and to the overall confidence levels 
of detection. 

The change in evidence between Go and G\ models is 
much larger in Example 1, where the injected glitch was 
a much more predominant feature in the data. However, 
the relative difference between evidence with and without 
glitch fitting, for both examples shown here, demonstrate 
overwhelming support for G\ models. 

We attribute the slight differences in the detection con- 
fidences to the GW signal injecti ons, w hich are scaled to 
achieve a desired ideal SNR = y/ (h\h). Because the data 
from Example 2 contains, on average, more power in each 
wavelet layer (due to the additional, irresolvable, low- 
SNR glitches) as compared to Example 1, the effective 
SNR of the glitch injections is lower than the indicated 
by the theoretical SNR that appears along the x-axis. 



The evidence ratios for the models that include glitch 
fitting shown in Figure [T5l should be compared to the ev- 
idence ratios found with no glitch fitting that are shown 
in Figure HU Here the dangers of using a Gaussian like- 
lihood function are thrown into stark relief. A fully con- 
vergent Baycsian calculation of the odds ratio favors the 
signal model even when no signal is injected! It is not 
until the injected signal has SNR above 16 that the re- 
covered parameters actually correspond to what was in- 
jected. In contrast, the glitch fitting model only favors 
the signal model when the injected signal parameters are 
recovered. Veitch and Vecchio [5l| have also observed 
that using a Gaussian noise model to calculate Baycs 
factors on glitchy data can lead to "false positives" , and 
they have argued that this can be accounted for by treat- 
ing the odds ratio as a frequentist statistic, that can be 
tuned using signal injections and time slides of the data. 
While this suggestion is not without merit, it comes at 
the cost of having much higher detection thresholds than 
those demonstrated here when the noise model includes 
glitch fitting. It is worth mentioning that the current 
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FIG. 9: Log evidence for each model under 
consideration for Example 1. The evidence is highest 
for the glitch-fitting models. The BH plus glitch model 
[G\,B\\ begins to distinguish itself from the glitch 
model [Gi,B ] above injected SNR of 8. Had we 
naively assumed a Gaussian noise model, the evidence 
calculation would have erroneously indicated a 
detection for arbitralily weak signal injections (false 
positives in the usual parlance). 



LIGO/ Virgo inspiral search pipeline Q would not be 
fooled into claiming a detection from the glitches injected 
in Examples 1 or 2 (events with significant SNR in just 
one detector do not make it through to the coincidence 
test). On the other hand, introducing the capability to 
model instrument glitches should help lower the thresh- 
olds in these searches. 



IX. DISCUSSION 

The goal of this work was to develop a better likelihood 
function (i.e., noise model) to account for non-Gaussian 
and non-stationary features in LIGO/ Virgo data, and to 
apply this new tool to the gravitational wave detection 
problem. We studied the performance of three noise mod- 
els proposed as an alternative to the standard stationary, 
Gaussian treatment: 

1. Noise samples are drawn from a Gaussian dis- 
tribution with a time-varying expectation value. 
[N , G ] 

2. The noise distribution is modeled as the sum of two 
Gaussians such that the bulk of the samples are 
representative of Gaussian noise but with higher 
probability for "large sigma" events. [Ni,Go] 

3. The noise consists of two contributions - non- 



FIG. 10: Log Bayes factor for models [G\,Bi\ vs. 
[Gi,Bq] for Example 1. The GW-signal model is 
favored when the injected signal has SNR above ~£ 




FIG. 11: Spectrograms showing the glitch injections 
into the three detectors and the geocenter gravitational 
waveform. The spectrograms have been whitened using 
the theoretical LIGO/Virgo noise spectra. The color 
map is on a logarithmic scale, and represents the 
Fourier power. 



stationary, Gaussian samples plus coherent (in time 
and frequency) "glitches" which are modeled by lin- 
ear combinations of wavelets. [No, G\], 

and measured their relative success at modeling differ- 
ent non-Gaussian noise simulations by comparing their 
marginalized likelihoods. 

The most successful of these, [No, G±], used a a trans- 
dimensional Markov chain Monte Carlo algorithm to de- 
termine the most parsimonious number of wavelets used 
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FIG. 12: Spectrograms showing the various components 
that make up the simulated data in the Virgo detector 
for Example 2. The spectrograms have been whitened 
using the theoretical Virgo noise spectrum. The loud, 
low frequency glitch at t ~ 1 second caused havoc for 
noise models that did not include glitch fitting. The 
color map is on a logarithmic scale, and represents the 
Fourier power. 
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GO, BO GO, B1 G1.B0 G1.B1 GO, BO GO, B1 G1.B0 G1.B1 



SNR = 10 SNR =12 
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FIG. 14: Log evidence for each model under 
consideration for Example 2. The evidence is highest 
for the glitch-fitting models. The BH model plus glitch 
model [G\,Bi] is favored for injected SNR's above 8. 
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FIG. 13: Time-to-coalescence search chains with and 
without glitch fitting for Example 2. Without glitch 
fitting the search locks onto a glitch in the Virgo 
detector (injection time ~ 1 s) while with glitch fitting 
the search locks onto the signal injection (t c ~ 6 s). The 
GW signal was injected with SNR of 12. The 
prescription for the glitch fitting is described in Will 



to fit the non-Gaussian constituents of the noise. This 
noise model performed as well or better than the others 
tested for each of the different noise simulations studied 
(see Figure |3| ■ None of these simulations used either of 
the noise models to generate the noise samples. 




S 6 7 8 9 10 11 12 13 

SNR 

FIG. 15: Log Baycs factor for models with glitch fitting, 

showing the odds that a signal is present, [Gi, Bi], 
relative to no signal being present, [Gi,Bq\. The signal 
model is favored for injected SNR's above 8. 



The more interesting application of the noise model- 
ing is to use it in conjunction with a gravitational wave 
search. We included the glitch-fitting (along with other 
improvements) into our parallel tempered Markov chain 
Monte Carlo detection algorithm originally described 
in [23l | . The study was performed using two simulations 
of the interferometer data. One contained Gaussian noise 
plus a single "loud" glitch intentionally overlapping the 
injected GW. The other contained Gaussian noise with 
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FIG. 16: Log Bayes factor for models without glitch 

fitting, showing the odds that a signal is present, 
[Go, Bi], relative to no signal being present, [Go, B ]. 
The signal model is always favored, even when no signal 
is injected. The parameters of the recovered signal do 
not correspond to those injected until the SNR 
reaches ~ 16. 



100 glitches distributed across the network having SNRs 
drawn from a distribution calibrated to match that of 
single detector's triggers in a LIGO S5 black hole ring- 
down search. We find no false positive detections when 
the glitch modeling is included, and are able to locate 
the gravitational wave signal injected into data with SNR 
above ~ 8 despite its overlapping with injected glitches. 
Detections are claimed using Bayesian evidence ratios 
calculated via thermodynamic integration (sec FigslQlfTOl 
HH and OH). 

When the glitch fitting is omitted in favor of the non- 
stationary Gaussian noise model (No, Go) the black hole 
search misses the astrophysical signals and instead at- 
tempts to fit to the injected glitches (see Figs. [7] and 
[T"3f until the injected SNR exceeds ~ 15. These false 
positive scenarios ubiquitously returned very large Bayes 
factors in favor of detection. By using a more realistic 
model for the noise we have mitigated the risk of false- 
positive detections due to non- Gaussian features and suc- 
cessfully extracted GW signals without throwing away 
the "glitchy" data. 

We are encouraged by these demonstrated results and 
are optimistic that this technique can be used as a foun- 
dation for modeling the instrument noise in real data. 
There is, however, still much work to be done. We will 
proceed down two avenues to improve the utility of our 
glitch-fitting detection algorithm. 

The first step is to study the performance of these mod- 
els on actual LIGO/Virgo data in an effort to further 



improve the glitch modeling. In particular, the priors 
we have constructed for the glitch parameters (including 
the number, location in time-frequency space, and am- 
plitudes of the glitches) were uninformed by the ongoing 
LIGO/Virgo detector characterization studies. Folding 
in this information should improve the effectiveness of 
the glitch modeling. 

There is no guarantee that wavelets are the best func- 
tions with which to decompose the glitches - they were 
chosen in part for their convenience, as they form an or- 
thogonal basis with a diagonal noise correlation matrix 
for Gaussian noise. We found that large glitches required 
~ 25 wavelet basis functions, whereas the analyses per- 
formed by Principe and Pinto [HI, [l6| were more efficient 
at matching the features in the data. Adopting a more 
parsimonious basis for fitting the glitches, such as wavelet 
wavepackets [53| , could allow the algorithm to dig deeper 
into the instrument noise and further lower the detection 
thresholds. 

A weakness of our current implementation is the se- 
quential nature of the initial search phase, which starts 
with glitch modeling and is then followed by a search for 
inspiral signals in the cleaned data (the characterization 
phase, by contrast, simultaneously updates the noise and 
signal models). This sequential approach is safe so long as 
the gravitational wave signals are everywhere below the 
instrument noise level, but it would be desirable to per- 
form the glitch removal in way that will not remove loud 
gravitational wave signals. To this end, we are currently 
developing a new version of the algorithm that combines 
the glitch modeling with a search for un- modeled gravi- 
tational wave signals. Wavelets are used to represent the 
two gravitational wave polarizations h + and h x at the 
Geocenter, and these signals are projected onto the detec- 
tor network with appropriate time delays and amplitudes 
corresponding to the proposed sky location. It is possible 
to specify if the signals are un-polarized, or have a partic- 
ular waveform polarization (circular, elliptical, linear). If 
a loud gravitational wave signal is present in the data, it 
is more parsimonious for the signal to be assigned to the 
Geocenter signal wavelets than the individual instrument 
wavelets (for the un-polarized search three detectors are 
required for this to work, while just two detectors are 
needed for the polarized search) . The new algorithm runs 
very quickly since the signals do not have to be wavelet 
transformed at each iteration (the timeshifts are done di- 
rectly in the wavelet domain). While designed to run as 
a burst search, the algorithm could be used for data con- 
ditioning prior to other searches by subtracting the best 
fit glitch model from each detector and leaving behind 
any power assigned to the signal model. 
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Appendix: A Recipe for Parallel Tempering and 
Thermodynamic Integration 

Parallel tempering is an effective way to keep Markov 
chains from locking onto single mode of the target pos- 
terior distribution function, with the added benefit of a 
simple and accurate means of calculating the model's ev- 
idence (thermodynamic integration). Parallel tempering 
involves running Nc chains simultaneously using a mod- 
ified Hasting's ratio 



p(s\0 y ,P) \ p(8 y )g(9 x \9 y ) 

v{s\e x ,p) P x ) q {e y \e x ) 



(27) 



where /3 = l/T is analogous to an inverse "temperature" 
and takes on an increasing value between and 1 for 
each chain. The effect of the exponent on the likelihood 
ratio in Eq. [S7J is to smooth the topography of the tar- 
get distribution, making differences between modes of a 
distribution less dramatic, and broadening the basin of 
attraction for each mode. A chain with a high tempera- 
ture (low /3) will freely move between modes and in the 
limit where T — > oo (j3 — > 0) the likelihood ratio goes to 
1 and the chain will sample the prior distribution (p(9)). 

Hotter chains efficiently sample the full prior volume 
and, in the event that they locate a region of higher pos- 
terior weight, are able to communicate that location to 
the colder chains without violating detailed balance by 
using the "chain swapping" Hasting's ratio: 
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(28) 



for an exchange between chains i and j. Meanwhile, 
the colder chains are more apt to thoroughly explore the 
space around whichever mode it is they sit. It is as if the 
hot chains wildly try different solutions, and pass those 
that offer good fits to the data to the cold chains where 
they can stored in the history of the chain, and refined 
with subsequent iterations. 

Direct exchange of states between two chains is the 
simplest way to couple the chains, but not the most ef- 
ficient. The use of genetic algorithms (GAs) to produce 
new solutions (the offspring) from parallel chains (the 
parents) further increases the benefits of parallel tem- 
pering. 

The post-burn- in samples from all of the parallel chains 
can be used to calculate the evidence for the model under 
consideration by integrating the average log-likelihood 
for each chain over /3 

logp(s\M) = [ (\ogp{s\6,M,p))dp 



P(logp(s\6,M,P))d\ogP. (29) 



By repeating the calculation for different models under 
consideration, one can calculate the Baycs factor (the ra- 
tio of the evidence for two models) or, for a larger number 
of models, the likelihood distribution (in "model space"). 
If the priors on competing models are uniform, the likeli- 
hood distribution is the model posterior up to a normal- 
ization constant. 



1. Heating schemes 

An efficient implementation of parallel tempering re- 
quires the temperature spacing between chains to be 
large enough that each is free to find it's own stationary 
state, but small enough that the chains are still frequently 
exchanging parameters with one another. At low temper- 
ature we tend to err on the side of communication, en- 
suring that the cold chains are efficiently sampling from 
modes of the posterior. This could, depending on the 
model, mandate very close spacing of the chains. On 
the other hand, as the temperature increases the likeli- 
hood ratio becomes increasingly less important in the 
Hasting's ratio, and parameter exchanges become fre- 
quently accepted. In this scenario the chains become 
over-coupled, preventing them from locating a station- 
ary solution. This, in turn, affects the accuracy of the 
evidence integration. For Gaussian likelihoods, geomet- 
ric spacing of chain temperatures is most effective. 

We use two temperature spacings, STh and 5T Cl for 
the hotter and colder chains. The change-over between 
the STs occurrs at some pre-set temperature T*. We 
have typically found that 5T C ~ 1.2—1.5 for the ./V* 
(colder) chains yields adequate mixing. A more sophis- 
ticated choice would be to adjust ST C during the burn-in 
phase until achieving a desired acceptance rate for the 
PT proposals. 

The temperature spacing for the remaining Nc — A* 
(hotter) chains is calculated such that the hottest chain 
has temperature r max with typical values between 10 2 
and 10 4 . We elect to fix the highest temperature, instead 
of the spacing for the hotter chains, so that the evidence 
integrals for different models occur over the same range 
regardless of which ST C or Tj we choose for each model. 

To determine T* we utilize the approximation that (for 
Gaussian posteriors) the effective SNR of the signal seen 
by a tempered chain chain is SNR c ff ~ SNR/VT. Be- 
cause a GW with SNR greater than ~ 10 will not gen- 
erally require careful model-selection (assuming we have 
done an adequate job of modeling the noise) we have used 
T* ~ 10. This corresponds to a maximum effective SNR 
at T* for marginally detectable signals of 10/y/lO ~ 3 
which is sufficiently small to guarantee that hotter chains 
will not "see" the GW signal. With the STs determined, 
the temperature for each chain in the ladder is 
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If locating and/or characterizing the modes of the pos- 
terior, without calculating the evidence, is the goal of the 
analysis the temperature ladder need not extend beyond 
iV*. Hotter chains are primarily sampling from the pri- 
ors and do not noticeably aid in the convergence of the 
colder chains (although they are critical in the evidence 
calculation). In other words, chains that are to be used 
for thermodynamic integration have to go out to much 
larger temperature than those used to produce samples 
the posterior. The exact number of chains, and the max- 
imum temperature needed for both scenarios, are tuned 
on a problem-by-problem basis. Key diagnostics for de- 
termining these settings are discussed later. 



2. Proposal distributions 

There are no rules mandating how often, or between 
which chains, parameter exchanges should be proposed. 
We have typically proposed exchanges only between ad- 
jacent chains, and done so between each chain once 
for every iteration of the MCMC. The parameter-swap 
proposal will first suggest an exchange between chains 
Nc ^ Nc — 1 (the hottest, and next to hottest chains), 
then iVc? — 1 <-> and so on, until a proposed param- 

eter switch between chainsl O- (the next to coldest and 
coldest chains). For model selection problems, when the 
maximum temperature is large, the hotter chains (each 
effectively sampling from the the prior) accept chain- 
swaps with excessively high frequency (> 90%). To 
prevent the hotter chains from over-coupling with their 
neighbors we propose exchanges between chains i and j, 
where Tj < Tj with probability (3j . The coldest chain has 
P = 1 and always attempts parameter exchanges during 
the PTMCMC proposal while hotter chains attempt to 
exchange parameters more rarely 

Proposal distributions within the chains are also in- 
formed by the chain temperature. Hotter chains should 
attempt large jumps in parameter space so they rapidly 
explore the entire prior volume, while colder chains 
should take smaller jumps so they efficiently accept new 
states in the chain. It is natural to scale the jumps by 
the temperature 

(31) 

which is, again, motivated by the assumption that the 
target distribution is Gaussian. For the hotter chains, 
this assumption is no longer valid, and these jumps can 
become large enough that they greatly exceed the prior 
ranges, or are never accepted if the priors are informative 
(i.e., have regions of high probability). Some care needs 
to be taken when scaling proposals by the chain tem- 
perature to avoid quenching the free-range exploration 
of the hottest chains. The hotter chains can frequently 
encounter the bounds of the prior volume. This can en- 
hance the need to carefully deal with jumps that are out- 
side of the prior range (for instance, using jumps along 



eigenvectors of the Fisher matrix when the current solu- 
tion is near the edge of the prior). We have used periodic 
and reflecting boundary conditions on the prior volume. 
Simply rejecting proposals that are out of bounds violates 
the detailed balance condition. 



3. Diagnostics 

In our experience, the most important diagnostic tool 
to use when testing an MCMC algorithm is to verify that 
the target distribution of a T — oo chain is identical to 
the prior distribution. Because you can do this study 
without ever calculating the likelihood (often the most 
computationally demanding part of each iteration) this 
is a quick, invaluable test. 

It is also important to note that burn-in times for hot 
and cold chains can be very different depending on the 
proposal distributions and the size of the prior ranges 
for the model parameters. It is always a good idea, at 
least during the testing of the algorithm, to store each 
temperatures chain. 

Other useful figures of merit are plots of 
(logp(s\8,M,f3)) versus /3. Figures [T7] and [TH] are 
annotated with some of the more valuable diagnostic 
features. In these examples, Mi has a higher dimension 
than Mq and was the more appropriate model. The 
higher dimensional model should normally have equal or 
(more often) higher likelihood at low temperature than a 
model with fewer degrees of freedom. Exceptions to this 
rule are possible if the two models are not nested (i.e., 
one model is not a limiting case of the other). We have 
exclusively used these methods to distinguish between 
nested models (most often the GW detection problem) 
so the following discussion assumes that to be the case. 

The average log-likelihood should increase smoothly 
and monotonically with increasing [3. A jagged (p(s\8)) 
curve is indicative of large errors in the calculation of 
the average log-likelihood, typically associated with poor 
convergence of the chains. Longer run times, more effi- 
cient proposal distributions, or different heating schemes 
can all help improve the convergence rates thus reducing 
the variance of the likelihood chain. 

At low temperature, Mi is able to achieve a better 
fit to the data and therefore returns a higher likelihood. 
This is true even if Mi is not the best description of 
the data due to its additional degrees of freedom. The 
high temperature chains supply the "Occam factor," or 
penalty for carrying that additional flexibility. 

The point at which Mq ~ Mi (the "equilibrium tem- 
perature") indicates when the high-dimensional model 
chain has become sufficiently hot that the "extra" model 
parameters lose touch with whatever feature they had 
been fitting, and begin predominantly sampling the pri- 
ors (see figure [T7|) . At this point it is safe to switch 
to the larger chain spacing, however the location of 
isn't something known a priori. Some educated tun- 
ing/guessing is needed to determine the location of such a 
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Ml has more degrees of freedom 



it low temperature extra 
degrees of freedom 
result in a better fit to the 



The Bayes factor for Models 1 :0 gets a positive contribution from the 
improved fit of Model 1 forT < T* and a negative contribution, 
the "Occam factor," because of the additional degrees of freedom 
forT>T* 



At high temperature extra 
degrees of freedom are 
poorly constrained and are 
able to do "more damage" to 
the overall fit. 
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FIG. 17: An example (log£) vs. /3 plot showing the 
location ol T* and other tips to understanding how 
parallel tempering and thermodynamic integration 
work. In this example the blue curve is for the model 
with a higher number of degrees of freedom, and was 
the favored explanation of the data. 



transition point. We can validate the choice of T* by en- 
suring that it is lower than the equilibrium temperature 
of the chains. 

The temperature range where a model switches from 
fitting to the data and sampling from the priors can re- 
sult in a steep drop in average likelihood, especially for 
high dimensional models. This large transition can be 
difficult to sample with chains automatically, as it can 
occur after the temperature spacing has been increased. 
If it is poorly resolved, the exact interval between chains 
where change occurs is not convergent between trial runs. 
Should this be the case, an additional block of finely sam- 
pled (in temperature) chains should be inserted around 
that region. 

At high temperature, the model with more flexibility is 
no longer constrained by the data, and the extra degrees 
of freedom result in, on average, a much worse fit to the 
data. 

To clarify this, consider the case where the the two 
models under consideration are that of the detection 
problem (for this discussion we will assume stationary 
Gaussian noise and uninformative priors). A noise-only 
model will not be able to accommodate a GW signal in 
the data, while a template will fit to the GW signal (if 
there is one) or to some part of the noise (if there is 
not). Regardless of the data, Mi will return a higher 
likelihood than Mq for low temperature. At high tem- 



perature, Mi is sampling from the priors on the GW 
parameters without being significantly informed by the 
data because the likelihood ratio in the transition proba- 
bility is suppressed by the temperature. Therefore, high 
temperature M\ chains are likely to include in the data a 
bright GW signal which can result in a large % 2 and ac- 
cordingly, a low value for the likelihood. Colloquially, Mi 
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FIG. 18: The same plot as is found in figure [T7] but on a 
log scale in /3. The log- version of these plots is a 
valuable diagnostic for the behavior of the chains at 
high temperature. This also shows the "rule of thumb" 
value for which passes the test of being lower than 
the point where Mi = Mq. 



can do a lot of "damage" to the residual if it is effectively 
unconstrained. 

For thermodynamic integration, the model selection 
question comes down to whether or not the improve- 
ment in likelihood for the colder chains can overwhelm 
the Occam factor supplied by the hotter chains. This 
emphasizes the importance of choosing an appropriate 
T max . The average likelihood stops decreasing once the 
temperature is high enough that additional chains in the 
ladder produce samples from identical distributions (the 
prior). This shows up as a floor in the likelihood versus 
inverse temperature plot best seen in Fig[TSJ Beyond this 
point, contributions to the evidence integral are propor- 
tional to f3 (which is approaching zero) and can safely be 
neglected. 
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